Objective. Using Python, the project goal is to implement a k-means clustering algorithm, a technique often used in machine learning, and use it for data analysis. We write various functions making use of lists, sets, dictionaries, sorting, and graph data structures for computational problem solving and analysis.


Part 1. Spotify API Data

First, we create a Client Credentials Flow Manager used in server-to-server authentication by passing the necessary parameters to the Spotify OAuth class. We provide a client id and client secret to the constructor of this authorization flow, which does not require user interaction.

# Set client id and client secret
client_id = 'xxx'
client_secret = 'xxx'

# Spotify authentication
client_credentials_manager = SpotifyClientCredentials(client_id, client_secret)
sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)

Part 1. The Algorithm

Objective. Design and implement a k-means clustering algorithm in Python.

Background

K-means clustering is a popular machine learning and data mining algorithm that discovers possible clusters within a dataset. Finding these clusters often reveals meaningful information from the data distribution. K-means clustering is used for a number of different applications, such as recognizing hand-written digits, which we will be implementing in this assignment.

K-means clustering works in four steps:

  1. Initialize some number \(k\) of cluster centers, also called centroids.
  2. For each data point in the dataset, assign it to the closest centroid.
  3. Update the locations of the centroids to be the average of all the points assigned to that cluster.
  4. Repeat steps 2 and 3 until convergence.

Note that the actual data points do not change. Only the locations of the centroids change with each iteration. And as the centroids move, the set containing the data points closest to each centroid alters. Below is an example of a single run of the K-means clustering.

Definitions

As with many machine learning techniques, this algorithm consists of a vast list of terminology which we define in a bit more detail below.

Definition 1. Distance The Euclidean distance, which indicates a straight line, is a simple way to calculate how close a data point is to a centroid. We calculate the euclidean distance using the Pythagorean theorem, defined as follows. For two points \(a = \left[a_1, a_2, \ldots, a_n\right]\) and \(b = \left[b_1, b_2, \ldots, b_n\right]\), where \(n\) is the current dimension, we define the euclidean distance between both points as

\[ \begin{align} \mathbf{\large\color{darkmagenta} D}(a, b) &= \sqrt{\left(a_1 - b_1\right)^2 + \left(a_2 - b_2\right)^2 + \ldots + \left(a_n - b_n\right)^2} \end{align} \]

Definition 2. Clusters A cluster is a collection of points that are part of the same group. For k-means, every point is part of a cluster. So as the algorithm progresses and the centroids shift, points might change which cluster they’re grouped in, even though the point itself does not move.

Definition 3. Centroids A centroid is the center of a cluster. In K-means, we assume that the centroid of a cluster is the average location of all the points in that cluster. This is equivalent to the average of the data points’ components in each dimension. So if we have three \(n\)-dimensional points \(a\), \(b\), and \(c\), we define the average as

\[ \mathit{average} = \left[ \dfrac{a_1 + b_1 + c_1}{3}, \dfrac{a_2 + b_2 + c_2}{3}, \dfrac{a_3 + b_3 + c_3}{3} \right] \]

Definition 4. Convergence An algorithm converges if the locations of all centroids do not change much between two iterations, e.g. within some threshold of \(1 \times 10^{-5}\).



References